Automatic Identification of Closely-related Indian Languages: Resources and Experiments
نویسندگان
چکیده
In this paper, we discuss an attempt to develop an automatic language identification system for 5 closely-related Indo-Aryan languages of India – Awadhi, Bhojpuri, Braj, Hindi and Magahi. We have compiled a comparable corpora of varying length for these languages from various resources. We discuss the method of creation of these corpora in detail. Using these corpora, a language identification system was developed, which currently gives state-of-the-art accuracy of 96.48 %. We also used these corpora to study the similarity between the 5 languages at the lexical level, which is the first data-based study of the extent of ‘closeness’ of these languages.
منابع مشابه
Identifying false friends between closely related languages
In this paper we present a corpus-based approach to automatic identification of false friends for Slovene and Croatian, a pair of closely related languages. By taking advantage of the lexical overlap between the two languages, we focus on measuring the difference in meaning between identically spelled words by using frequency and distributional information. We analyze the impact of corpora of d...
متن کاملImprovement of generative adversarial networks for automatic text-to-image generation
This research is related to the use of deep learning tools and image processing technology in the automatic generation of images from text. Previous researches have used one sentence to produce images. In this research, a memory-based hierarchical model is presented that uses three different descriptions that are presented in the form of sentences to produce and improve the image. The proposed ...
متن کاملMultilingual Speech Recognition for Information Retrieval in Indian Context
This paper analyzes various issues in building a HMM based multilingual speech recognizer for Indian languages. The system is originally designed for Hindi and Tamil languages and adapted to incorporate Indian accented English. Language-specific characteristics in speech recognition framework are highlighted. The recognizer is embedded in information retrieval applications and hence several iss...
متن کاملAddressing challenges in automatic Language Identification of Romanized Text
Due to the diversity of documents on web, language identification is a vital task for web search engines during crawling and indexing of web documents. Among the current challenges in language-identification, the unsettled problem remains identifying Romanized text language. The challenge in Romanized text is the variations in word spellings and sounds in different dialects. We propose a Romani...
متن کاملدر کاربرد تشخیص زبان گفتاری GMM-VSM در قالب سیستم GMM
GMM is one of the most successful models in the field of automatic language identification. In this paper we have proposed a new model named adapted weight GMM (AW-GMM). This model is similar to GMM but the weights are determined using GMM-VSM LID system based on the power of each component in discriminating one language from the others. Also considering the computational complexity of GMM-VSM,...
متن کامل